Predictive Maintenance

Based on kaggle dataset
https://www.kaggle.com/ludobenistant/predictive-maintenance-1/data

Step by step approach with pandas data frames, then have a HL view of Dask possibilities, then see NN of the data.


In [1]:
# Import Libraries needed
import pandas as pd                 #dataframe manipulation
import numpy as np                  #numerical processing of vectors
import matplotlib.pyplot as plt     #plotting
%matplotlib inline                

#import tensorflow as tf
import sklearn
from sklearn import tree
import graphviz
import dask


print("Pandas:\t\t", pd.__version__)
print("Numpy:\t\t", np.__version__)
#print("Tensorflow:\t", tf.__version__)
print("Dask:\t\t", dask.__version__)
print("Scikit-learn:\t", sklearn.__version__)


Pandas:		 0.22.0
Numpy:		 1.14.0
Dask:		 0.16.0
Scikit-learn:	 0.19.1

In [2]:
df_init = df = pd.read_csv('./maintenance_data.csv')

In [3]:
df.columns


Out[3]:
Index(['lifetime', 'broken', 'pressureInd', 'moistureInd', 'temperatureInd',
       'team', 'provider'],
      dtype='object')

In [4]:
df.head()


Out[4]:
lifetime broken pressureInd moistureInd temperatureInd team provider
0 56 0 92.178854 104.230204 96.517159 TeamA Provider4
1 81 1 72.075938 103.065701 87.271062 TeamC Provider4
2 60 0 96.272254 77.801376 112.196170 TeamA Provider1
3 86 1 94.406461 108.493608 72.025374 TeamC Provider2
4 34 0 97.752899 99.413492 103.756271 TeamB Provider1

In [5]:
df.describe()


Out[5]:
lifetime broken pressureInd moistureInd temperatureInd
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 55.195000 0.397000 98.599338 99.376723 100.628541
std 26.472737 0.489521 19.964052 9.988726 19.633060
min 1.000000 0.000000 33.481917 58.547301 42.279598
25% 34.000000 0.000000 85.558076 92.771764 87.676913
50% 60.000000 0.000000 97.216997 99.433959 100.592277
75% 80.000000 1.000000 112.253190 106.120762 113.662885
max 93.000000 1.000000 173.282541 128.595038 172.544140

In [6]:
df.sort_values(by='lifetime', ascending=True).head()


Out[6]:
lifetime broken pressureInd moistureInd temperatureInd team provider
52 1 0 72.423129 120.484155 116.400433 TeamB Provider2
205 1 0 92.100802 89.994611 97.292290 TeamB Provider1
464 1 0 141.042595 113.358822 91.310554 TeamC Provider2
71 1 0 107.785993 117.390336 113.377308 TeamB Provider4
209 1 0 104.658530 107.537666 126.374317 TeamB Provider1

In [7]:
df.sort_values(by='lifetime', ascending=True).tail()


Out[7]:
lifetime broken pressureInd moistureInd temperatureInd team provider
283 93 1 79.639584 115.057383 87.223725 TeamB Provider2
746 93 1 81.289168 105.856819 80.757421 TeamB Provider2
106 93 1 109.671138 104.600332 102.881957 TeamB Provider2
154 93 1 129.585306 93.692108 84.915725 TeamA Provider2
649 93 1 92.976675 98.641371 101.332116 TeamA Provider2

In [8]:
plt.bar(df.sort_values('team').team, df.sort_values('lifetime').lifetime)


Out[8]:
<Container object of 1000 artists>

In [9]:
df.groupby([df.team, df.broken]).count()


Out[9]:
lifetime pressureInd moistureInd temperatureInd provider
team broken
TeamA 0 213 213 213 213 213
1 123 123 123 123 123
TeamB 0 206 206 206 206 206
1 150 150 150 150 150
TeamC 0 184 184 184 184 184
1 124 124 124 124 124

In [10]:
df.groupby(['team','broken']).agg({'broken': 'count'}).apply(lambda x:100 * x / float(x.sum()))


Out[10]:
broken
team broken
TeamA 0 21.3
1 12.3
TeamB 0 20.6
1 15.0
TeamC 0 18.4
1 12.4

In [11]:
show_perc = df.groupby(['team','broken']).agg({'broken': 'count'})
show_perc.apply(lambda x:100 * x / float(x.sum()))


Out[11]:
broken
team broken
TeamA 0 21.3
1 12.3
TeamB 0 20.6
1 15.0
TeamC 0 18.4
1 12.4

In [12]:
column = 'provider'
show_perc = df.loc[df['broken'] == 1].groupby([column]).agg({'broken': 'count'})
show_perc.apply(lambda x:round(100 * x / float(x.sum()),2)).rename(columns={"broken": "%"})


Out[12]:
%
provider
Provider1 29.22
Provider2 22.92
Provider3 28.72
Provider4 19.14

Decision Tree Classification of existing Data with Scikit-learn

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
http://scikit-learn.org/stable/modules/tree.html


In [13]:
tree_data = df_init.drop(columns=['broken'])
tree_target = df_init.broken

#workaround replacement strings to integers - DO NOT DO IT LIKE THIS ;-)
try:
    tree_data.replace('TeamA',1, inplace=True)
    tree_data.replace('TeamB',2, inplace=True)
    tree_data.replace('TeamC',3, inplace=True)
    tree_data.replace('Provider1',1, inplace=True)
    tree_data.replace('Provider2',2, inplace=True)
    tree_data.replace('Provider3',3, inplace=True)
    tree_data.replace('Provider4',4, inplace=True)
except:
    pass  

#convert dataframes to arrays
tree_data = tree_data.values
tree_target = tree_target.values
#column names - labels
tree_feature_names = ['lifetime', 'pressureInd', 'moistureInd', 'temperatureInd', 'team', 'provider']
#target names - class
tree_target_names = ['BROKEN!','Operational']

#Tree Classifiers
tree_clf = tree.DecisionTreeClassifier()

#tree_clf.set_params(max_depth=2)

tree_clf = tree_clf.fit(tree_data, tree_target)

tree_clf.get_params()


Out[13]:
{'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': False,
 'random_state': None,
 'splitter': 'best'}

In [14]:
#output graph tree
tree_dot_data = tree.export_graphviz(tree_clf, 
                                out_file=None, 
                                feature_names=tree_feature_names,
                                class_names=tree_target_names,
                                filled=True, 
                                rounded=True,
                                special_characters=True) 
graph = graphviz.Source(tree_dot_data) 
graph.render("Maintenance_classification_tree")
#show tree
graph


Out[14]:
Tree 0 lifetime ≤ 59.5 gini = 0.479 samples = 1000 value = [603, 397] class = BROKEN! 1 gini = 0.0 samples = 469 value = [469, 0] class = BROKEN! 0->1 True 2 lifetime ≤ 78.5 gini = 0.377 samples = 531 value = [134, 397] class = Operational 0->2 False 3 provider ≤ 2.5 gini = 0.487 samples = 250 value = [105, 145] class = Operational 2->3 22 provider ≤ 1.5 gini = 0.185 samples = 281 value = [29, 252] class = Operational 2->22 4 lifetime ≤ 72.5 gini = 0.43 samples = 99 value = [68, 31] class = BROKEN! 3->4 11 provider ≤ 3.5 gini = 0.37 samples = 151 value = [37, 114] class = Operational 3->11 5 gini = 0.0 samples = 54 value = [54, 0] class = BROKEN! 4->5 6 lifetime ≤ 75.0 gini = 0.429 samples = 45 value = [14, 31] class = Operational 4->6 7 team ≤ 2.0 gini = 0.114 samples = 33 value = [2, 31] class = Operational 6->7 10 gini = 0.0 samples = 12 value = [12, 0] class = BROKEN! 6->10 8 gini = 0.0 samples = 2 value = [2, 0] class = BROKEN! 7->8 9 gini = 0.0 samples = 31 value = [0, 31] class = Operational 7->9 12 lifetime ≤ 64.0 gini = 0.123 samples = 122 value = [8, 114] class = Operational 11->12 21 gini = 0.0 samples = 29 value = [29, 0] class = BROKEN! 11->21 13 team ≤ 2.5 gini = 0.289 samples = 40 value = [7, 33] class = Operational 12->13 16 moistureInd ≤ 108.351 gini = 0.024 samples = 82 value = [1, 81] class = Operational 12->16 14 gini = 0.0 samples = 7 value = [7, 0] class = BROKEN! 13->14 15 gini = 0.0 samples = 33 value = [0, 33] class = Operational 13->15 17 gini = 0.0 samples = 66 value = [0, 66] class = Operational 16->17 18 moistureInd ≤ 108.494 gini = 0.117 samples = 16 value = [1, 15] class = Operational 16->18 19 gini = 0.0 samples = 1 value = [1, 0] class = BROKEN! 18->19 20 gini = 0.0 samples = 15 value = [0, 15] class = Operational 18->20 23 gini = 0.0 samples = 85 value = [0, 85] class = Operational 22->23 24 lifetime ≤ 80.5 gini = 0.252 samples = 196 value = [29, 167] class = Operational 22->24 25 gini = 0.0 samples = 6 value = [6, 0] class = BROKEN! 24->25 26 lifetime ≤ 85.5 gini = 0.213 samples = 190 value = [23, 167] class = Operational 24->26 27 team ≤ 2.5 gini = 0.383 samples = 62 value = [16, 46] class = Operational 26->27 48 moistureInd ≤ 78.82 gini = 0.103 samples = 128 value = [7, 121] class = Operational 26->48 28 gini = 0.0 samples = 11 value = [11, 0] class = BROKEN! 27->28 29 temperatureInd ≤ 66.466 gini = 0.177 samples = 51 value = [5, 46] class = Operational 27->29 30 gini = 0.0 samples = 1 value = [1, 0] class = BROKEN! 29->30 31 pressureInd ≤ 67.325 gini = 0.147 samples = 50 value = [4, 46] class = Operational 29->31 32 moistureInd ≤ 97.777 gini = 0.5 samples = 2 value = [1, 1] class = BROKEN! 31->32 35 provider ≤ 3.0 gini = 0.117 samples = 48 value = [3, 45] class = Operational 31->35 33 gini = 0.0 samples = 1 value = [0, 1] class = Operational 32->33 34 gini = 0.0 samples = 1 value = [1, 0] class = BROKEN! 32->34 36 lifetime ≤ 83.0 gini = 0.305 samples = 16 value = [3, 13] class = Operational 35->36 47 gini = 0.0 samples = 32 value = [0, 32] class = Operational 35->47 37 gini = 0.0 samples = 1 value = [1, 0] class = BROKEN! 36->37 38 pressureInd ≤ 107.767 gini = 0.231 samples = 15 value = [2, 13] class = Operational 36->38 39 moistureInd ≤ 96.141 gini = 0.142 samples = 13 value = [1, 12] class = Operational 38->39 44 temperatureInd ≤ 118.564 gini = 0.5 samples = 2 value = [1, 1] class = BROKEN! 38->44 40 moistureInd ≤ 94.476 gini = 0.32 samples = 5 value = [1, 4] class = Operational 39->40 43 gini = 0.0 samples = 8 value = [0, 8] class = Operational 39->43 41 gini = 0.0 samples = 4 value = [0, 4] class = Operational 40->41 42 gini = 0.0 samples = 1 value = [1, 0] class = BROKEN! 40->42 45 gini = 0.0 samples = 1 value = [1, 0] class = BROKEN! 44->45 46 gini = 0.0 samples = 1 value = [0, 1] class = Operational 44->46 49 gini = 0.0 samples = 1 value = [1, 0] class = BROKEN! 48->49 50 temperatureInd ≤ 43.169 gini = 0.09 samples = 127 value = [6, 121] class = Operational 48->50 51 gini = 0.0 samples = 1 value = [1, 0] class = BROKEN! 50->51 52 moistureInd ≤ 100.136 gini = 0.076 samples = 126 value = [5, 121] class = Operational 50->52 53 gini = 0.0 samples = 68 value = [0, 68] class = Operational 52->53 54 moistureInd ≤ 100.303 gini = 0.158 samples = 58 value = [5, 53] class = Operational 52->54 55 gini = 0.0 samples = 2 value = [2, 0] class = BROKEN! 54->55 56 lifetime ≤ 87.5 gini = 0.101 samples = 56 value = [3, 53] class = Operational 54->56 57 team ≤ 2.5 gini = 0.346 samples = 9 value = [2, 7] class = Operational 56->57 60 moistureInd ≤ 104.041 gini = 0.042 samples = 47 value = [1, 46] class = Operational 56->60 58 gini = 0.0 samples = 2 value = [2, 0] class = BROKEN! 57->58 59 gini = 0.0 samples = 7 value = [0, 7] class = Operational 57->59 61 moistureInd ≤ 103.944 gini = 0.165 samples = 11 value = [1, 10] class = Operational 60->61 64 gini = 0.0 samples = 36 value = [0, 36] class = Operational 60->64 62 gini = 0.0 samples = 10 value = [0, 10] class = Operational 61->62 63 gini = 0.0 samples = 1 value = [1, 0] class = BROKEN! 61->63

To better have a look at the tree


In [15]:
df_init.drop(columns=['broken']).columns


Out[15]:
Index(['lifetime', 'pressureInd', 'moistureInd', 'temperatureInd', 'team',
       'provider'],
      dtype='object')

In [16]:
#PREDICTION WITHOUT REGRESSION - 1-->BROKEN 0-->Operational
print('instance 1 prediction: ', tree_clf.predict([[70., 100., 100., 100., 1., 3.]]))
print('instance 2 prediction: ', tree_clf.predict([[70., 100., 100., 100., 1., 1.]]))


instance 1 prediction:  [1]
instance 2 prediction:  [0]

In [ ]: